We are interested in solving linear, or matrix, system of arbitrary shape and size,
(1) A*x= b
Where the solution obtained from naïve method is badly behaved. This happens for various reasons, such as duplicated data, or other reasons for singularity. But the interesting case is when it happens because the right hand side has more data error than the system can tolerate and get a good answer. This comes up commonly in “inverse problems”, meaning problem in which we are trying to back up diffusive process in time. For example, in solving the “inverse heat problem” we are trying to determine what the heat profile of some linear arrangement was at an earlier time. Here is a graph of that ideal original hear profile in this problem:
Here is an example “solution” obtained by naively solving the 13 row by 17 column problem (see data files accompanying this document):
The natural way of solving any hard linear system is to start with the widely available Singular Value Decomposition:
(2) A = U*S*VT
Where U*UT = I, UT*U=I, V*VT =I, VT*V=I, and S is a diagonal matrix of decreasing “singular values”. We will ignore the minor aggravation of whether U and V and the full, square matrices of theory, or a reduced shape which takes advantage of the chance to save space by reducing the sizes of U and V when there are singular values which are zero. Those matters are only of concern to the code developer.
We solve the system (1) as follows:
(3) A * x = b
(4) U * S * VT * x = b
(5) S * VT * x = UT * b
(6) VT * x = S+ * UT * b = PCV
(7) x = V * S+ * UT * b
Here S+ is the pseudo-inverse of S, meaning that 0 or negligible values of S[i] are treated instead as being infinity (∞).
This “pseudoinverse” approach only helps when the difficulty is the presence of one or more “simple” singularities due to such problems as duplicated rows. Sometimes that is enough. In statistical contexts the matrix V * S+ * UT is referred to as the “pseudo-inverse of A and seems to be important.
The next step in trying to solve (1) is to set MORE elements of S to zero (or ∞, whichever way you want to think of it). For example, if the average accuracy of the elements of A is, say, 4 significant digits, then one might set all the singular values at the end of S to zero that add together to less than 0.0001 times the average element of A, or 0.0001 times S. But this doesn’t really help in general because the problem is usually with the error in b, not the error in A.
For our purposes zero singular values are not a difficulty… we always treat them as the pseudoinverse does. The more important case is SMALL singular values.
Papers have been written for over thirty years on how to automatically resolve the problem of excessively small singular values, or “ill-conditioning” in linear equation systems. These problems are characterized as having wildly unacceptable solutions, especially when the solution is expected to be non-negative. And, even irrelevantly small changes to the observed data, b, can cause the solution to be wildly different yet, and still unacceptable.
I studied this problem many years ago as part of my Ph.D. under the well-known Dr. Cleve Moler, the inventor of MATLAB. Many papers have been published since then, with various approaches and degrees of success, and various kinds of constraints or conditions. I have presented later versions of my algorithm in C++ at appropriate conferences, and I also conferred with a Russian mathematician on one of these competitive algorithm. My present algorithm is not very “theoretical”, and not a very good candidate for publication, but it is practical, efficient, and adaptable. For example, in my C++ package I allow any number of both equality constraints and inequality constraints to be added to the problem. However, there is little general use for such matters. In the current package we offer only optional one extra constraint: that the solution be non-negative. With this perspective, let’s go on:
Our interest in dealing with this ill-conditioning is in the right hand side of (6), which we have labelled “PCV” for Picard Condition Vector. Picard’s condition states that the elements of the PCV must converge toward zero for the system to be solvable stably. When the system (1) is ill-conditioned, that fact will show here by the PCV growing inappropriately in its latter elements. This happens because the elements of UT * b drop to zero FASTER that the singular values, S[i], drop to zero. In fact, what happens is that the S[i] march steadily toward zero, as that is the nature of the singular values in these case, whereas the latter elements of UT * b remain at a significantly high level due to noise in A and/or b. Here is an example PCV for the small inverse heat problem we mentioned.
Obviously there is a real problem here! Next we zoom in to see the first few values:
All appears to behave well with the PCV for the first half dozen elements. Then something wrong happens. Below is a spread sheet of the singular values and the elements of UT * b. The PCV is computed by dividing the elements of UT * b in the second column by the singular values in the first column. That quotient is in the third column. You can see that the singular values are marching nicely to zero. But the UT * b values seem to plateau at random plus and minus values a little over 0.0001 in magnitude. Since the latter values in UT * b should – by Picard – have been two or three orders of magnitude smaller than they are, we can attribute maybe 99% of their magnitude to noise in A and b. In fact, since U (and UT) have all ones for their singular values, the latter element of UT * b are irrelevantly transformed noise samples of b. So we can take the “latter values” of UT * b to be of the same magnitude as the original error in b.
NOW… the crux of our situation: if we knew an algorithm to automatically tell WHERE to start in the second column we could automatically estimate the root-mean-square, of the error in b – it is the same as the RMS of these latter values in UT * b. That would be very helpful because it would let us then determine how much to regularize the solution process (more on this later). You will notice that the choice of where to start counting the second column estimates as error samples is not particularly critical on the high side: we consider rows 6 to 13 to be error samples; or with little change, consider rows 7 to 13 to be error samples; and in this problem we could also use 8 to 13, and even further. But often we will not have to luxury of so many rows of this table to work with. Many cases of small systems will have only a couple of error samples: so they are precious and must not be wasted. On the other hand, estimating the usable rows as 4 or 5 instead of 6 or 7 could be catastrophic to the solution as that choice would unjustifiably increase the RMS of the “error samples”. This would cause us to over-regularize the problem.
So you may be asking, if the “usable rank” of this problem is clearly around 5 to 7, why not just try setting all singular values past 5, 6, and 7 to zero and select one of those answers? For two reasons: we need an AUTOMATIC process. How would we choose between those three automatically? That is a difficult matter. And second, NONE of those three may be very good. We can do better than that, and automatically. (However, for the record, in this particular example zeroing singular value past rank 5 produces a useful answer almost as good as autorej’s result. This is equivalent to using a Generalized Inverse with a Singular Value cutoff of 0.004. But using a cutoff of 0.005 would product a bad solution.)
Before talking about our algorithm for determining “usable rank”, I want to emphasize that the above example is unusually clear. The PCV can vary drastically. I have seen it be roughly level for a few values, then steadily rise. It can oscillate between near zero and very positive and negative large values. It is very unpredictable. So our methods must be aware of the complexity and pitfalls.
We have developed for autorej two heuristic algorithms for choosing that point in the PCV at which noise starts dominating the elements of UT*b. We have tuned the parameters with many problems for many years. The first algorithm, called splita in the code, uses a pair of weighted moving averages. One lazily follows successive values of the PCV which are larger than its current value. The second lazily follows successive values of the PCV which are smaller than its current value, but also VERY lazily follows larger values of the PCV. When the first average becomes substantially larger than the second average then autorej declares that point to be the end of any possible useful data.
The second algorithm, called splitb in the code, re-analyses the PCV data, up to the upper rank limit determined by splita, using a fixed width, unweighted average of the SQUARES of the elements of the PCV. This allows a closer look at when the values get too large. The fixed width that is chosen depends on problem size but is never greater than 6. We look carefully at these sums and stop when one successive moving sum is substantially larger than the previous sum. This result from splitb becomes what we call the “usable rank”, and values beyond that in UT*b are taken to be samples of the error in the original RHS, b.
So how do we use these noise values to find a solution? Simply: the RMS (root-mean-square) of them becomes the parameter SIGMA that we use to proceed with the solution. This process of ”Tikhonov regularization” can be presented several ways, but this straightforward method seems adequate:
To the original A*x=b, we add equations asking the x values to be zero. That is, we add
(8) xi = 0.0 , for all indices i
But we weight these extra equations by a small value which is usually called lambda, or λ.
(9) λ xi = 0.0 , for all indices i
We write this enhanced system in matrix form like this:
(10) A * x = b
(11) λ * I * x = 0 ( I is an appropriately sized identity matrix)
If λ is zero, there is no change to x. But the larger lambda is, the more we will force the computed vector, x, from its nominal pseudoinverse result toward a solution of all zeros. We choose which x we want with the “discrepancy method”: that is, we search for a value of λ for which the resulting solution, x(λ), satisfies this (where x(0) is the pseudoinverse solution, for which λ is not present):
(12) RMS( x(λ) -x(0) ) = sigma
That is, we, or rather, autorej asks that the new, auto-regularized solution deviate from the unregularized solution by exactly SIGMA on average (in RMS sense).
And that does it. The “Gordian knot” has been untied.
Finally, if the user so asks, we append a non-negativity algorithm to the above solution process. This is as in the classic book, “Solving Least Squares Problems” by Lawson & Hanson (Hanson being another member of my dissertation committee) . Basically, we pick a negative result, and zero its column. Then we resolve. That value will now be zero, and the solution will have adapted to part to that change. Then you pick another to zero out. The question is, what ORDER should these be picked? Our choice is to force the most negative xi to become zero. We iterate on this until there are no non-roundoff negative solution values.
This is the resulting auto-regularized solution for the inverse heat problem above:
And this is the result with the nonnegative option:
Note that the negative values didn’t just disappear: the solution adjusted better represent the flat region in the first 10 samples. This may not look like a great solution, but in most applications involving these ill-conditioned systems, this is a fabulous automatic solution.
Return to main autorej page.